Goto

Collaborating Authors

 general utility





Supplementary Material for " Variational Policy Gradient Method for Reinforcement Learning with General Utilities " A Related Work

Neural Information Processing Systems

We provide a more extension discussion for the context of this work. Firstly, when closed-form expressions for the optimizer of a function are unavailable, solving optimization problems requires iterative schemes such as gradient ascent [31]. Their convergence to global extrema is predicated on concavity and the tractability of computing ascent directions. When the objective takes the form of an expected value of a function parameterized by a random variable, stochastic approximations are required [36, 24]. The PG Theorem mentioned above gives a specific form for obtaining ascent directions with respect to a parameterized family of stationary policies via trajectories in a Markov decision process, when the objective is the expected cumulative return [44], which gives rise to the REINFORCE algorithm.


Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Neural Information Processing Systems

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases.


Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with General Utilities

Neural Information Processing Systems

We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. The objective and constraints are described by general utilities, i.e., nonlinear functions of the long-term state-action occupancy measure, which encompass broader decision-making goals such as risk, exploration, or imitations. The exponential growth of the state-action space size with the number of agents presents challenges for global observability, further exacerbated by the global coupling arising from agents' safety constraints. To tackle this issue, we propose a primal-dual method utilizing shadow reward and $\kappa$-hop neighbor truncation under a form of correlation decay property, where $\kappa$ is the communication radius. In the exact setting, our algorithm converges to a first-order stationary point (FOSP) at the rate of $\mathcal{O}\left(T^{-2/3}\right)$. In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $\widetilde{\mathcal{O}}\left(\epsilon^{-3.5}\right)$


Robust Reinforcement Learning with General Utility

Neural Information Processing Systems

Reinforcement Learning (RL) problem with general utility is a powerful decision making framework that covers standard RL with cumulative cost, exploration problems, and demonstration learning. Existing works on RL with general utility do not consider the robustness under environmental perturbation, which is important to adapt RL system in the real-world environment that differs from the training environment. To train a robust policy, we propose a robust RL framework with general utility, which subsumes many existing RL frameworks including RL, robust RL, RL with general utility, constrained RL, robust constrained RL, pure exploration, robust entropy regularized RL, etc. Then we focus on popular convex utility functions, with which our proposed learning framework is a challenging nonconvex-nonconcave minimax optimization problem, and design a two-phase stochastic policy gradient type algorithm and obtain its sample complexity result for gradient convergence. Furthermore, for convex utility on a widely used polyhedral ambiguity set, we design an algorithm and obtain its convergence rate to a global optimal solution.